Goto

Collaborating Authors

 attention score matrix







SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models

Guo, Suhan, Deng, Jiahong, Yi, Mengjun, Shen, Furao, Zhao, Jian

arXiv.org Artificial Intelligence

Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for $\textbf{At}$tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, $\textbf{S}$ensitivity $\textbf{E}$nhanced $\textbf{N}$ormalized $\textbf{D}$ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs. Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based and LLM-based SOTA methods in both standard and zero-shot inference, highlighting the importance of retaining only the most effective attention mechanisms. We have made our code publicly available https://anonymous.4open.science/r/SPAT-6042.


EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

Chen, Xinwang, Liu, Ning, Zhu, Yichen, Feng, Feifei, Tang, Jian

arXiv.org Artificial Intelligence

Transformer-based Diffusion Probabilistic Models (DPMs) have shown more potential than CNN-based DPMs, yet their extensive computational requirements hinder widespread practical applications. To reduce the computation budget of transformer-based DPMs, this work proposes the Efficient Diffusion Transformer (EDT) framework. The framework includes a lightweight-design diffusion model architecture, and a training-free Attention Modulation Matrix and its alternation arrangement in EDT inspired by human-like sketching. Additionally, we propose a token relation-enhanced masking training strategy tailored explicitly for EDT to augment its token relation learning capability. Our extensive experiments demonstrate the efficacy of EDT. The EDT framework reduces training and inference costs and surpasses existing transformer-based diffusion models in image synthesis performance, thereby achieving a significant overall enhancement. With lower FID, EDT-S, EDT-B, and EDT-XL attained speed-ups of 3.93x, 2.84x, and 1.92x respectively in the training phase, and 2.29x, 2.29x, and 2.22x respectively in inference, compared to the corresponding sizes of MDTv2. The source code is released here.


ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering

Nguyen, Nghia Hieu, Quan, Tho Thanh, Nguyen, Ngan Luu-Thuy

arXiv.org Artificial Intelligence

Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their bounding boxes. In this study, we follow the definition of meaning from linguistics to introduce a novel method that effectively exploits the information from scene texts written in Vietnamese. Experimental results show that our proposed method obtains state-of-the-art results on two large-scale Vietnamese Text-based VQA datasets. The implementation can be found at this link.


MARE: Multi-Aspect Rationale Extractor on Unsupervised Rationale Extraction

Jiang, Han, Duan, Junwen, Qu, Zhe, Wang, Jianxin

arXiv.org Artificial Intelligence

Unsupervised rationale extraction aims to extract text snippets to support model predictions without explicit rationale annotation. Researchers have made many efforts to solve this task. Previous works often encode each aspect independently, which may limit their ability to capture meaningful internal correlations between aspects. While there has been significant work on mitigating spurious correlations, our approach focuses on leveraging the beneficial internal correlations to improve multi-aspect rationale extraction. In this paper, we propose a Multi-Aspect Rationale Extractor (MARE) to explain and predict multiple aspects simultaneously. Concretely, we propose a Multi-Aspect Multi-Head Attention (MAMHA) mechanism based on hard deletion to encode multiple text chunks simultaneously. Furthermore, multiple special tokens are prepended in front of the text with each corresponding to one certain aspect. Finally, multi-task training is deployed to reduce the training overhead. Experimental results on two unsupervised rationale extraction benchmarks show that MARE achieves state-of-the-art performance. Ablation studies further demonstrate the effectiveness of our method. Our codes have been available at https://github.com/CSU-NLP-Group/MARE.


An alternative formulation of attention pooling function in translation

Conti, Eddie

arXiv.org Artificial Intelligence

The aim of this paper is to present an alternative formulation of the attention scoring function in translation tasks. Generally speaking, language is deeply structured, and this is reflected in the attention scoring matrix. We exploit this property to define the attention pooling function, taking this aspect into account. In the first chapters, we introduce the attention mechanism in mathematical terms and explain its limitations and alternative formulations. Next, we focus on the experimental session that led to the alternative formulation. Essentially, we guide queries and keys to interact in a specific manner, encoding the distinct roles of attention heads and directing values on where to seek context. In mathematical terms, we can think of this formula as projecting the attention scores matrix, say $H$, onto the space of band matrices with fixed bandwidth. This convex subspace is clearly finite-dimensional and therefore closed. As a consequence, the projection on this space is well-posed and unique. However, at the price of losing the uniqueness of the projection (i.e., the best approximation for $H$), we defined a new space consisting of band matrices plus error sparse matrices. We prove that this is a compact subspace which guarantees the existence of a matrix that best approximates $H$. We conclude the thesis by validating the new formula, namely calculating how well the new formula for attention scores approximates the original one. Additionally, we explore the impact of different parameters such as w (context windows) and num-pos (number of relevant words in a sentence). These analyses provide deeper insights into how languages are processed and translated, revealing nuances in the roles of context and word relevance.